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Introduction 


The  US  Army  MRMC  is  interested  in  applying  Spatial  Paradigm  for  Information  Retrieval  and 
Exploration  (SPIRE)  technology  to  data  sets  of  trauma  related  information.  SPIRE  has  generally 
been  designed  to  work  with  larger  data  sets  of  a  more  unstructured  nature  (i.e.  newswatch  data, 
message  trafficking).  The  purpose  of  this  research  was  to  determine  if  SPIRE  would  produce 
meaningful  analysis  of  this  trauma  data.  The  trauma  data  provided  by  Dr.  Howard  Champion  of 
the  University  of  Maryland,  has  been  processed  and  analyzed.  The  data  consisted  of  86 
documents  from  two  different  sources  pertaining  to  auto  accidents.  Both  sources  of  documents 
discuss  traumas;  however,  one  set  discussed  more  of  the  accidents,  while  the  other  source 
discussed  more  of  the  medical  problems.  SPIRE  identified  the  two  major  differences  in  data 
sources,  but,  according  to  our  analysis,  could  find  little  other  meaningful  differentiation  among  the 
documents.  From  our  initial  analysis,  it  appears  as  though  the  low  number  of  documents  and  the 
structured  nature  of  those  documents  impacted  the  performance  of  SPIRE  on  this  data  set. 


Task  Description  and  Study  Results 

SPIRE  technology,  as  it  exists  today,  assesses  the  patterns  of  word  use  in  documents  (e.g. 
frequency  clusters  of  an  individual  word  implies  salience  in  theme)  to  determine  important  topics 
and  relationships.  Natural  language  communication  utilizes  definite  strategies  for  conveying  the 
content  particularly  when  substantial  knowledge  is  not  assumed.  SPIRE  is  dependent  on  this. 
Because  of  the  inherent  structure  in  the  trauma  data,  the  SPIRE  system  has  the  tendency  to 
identify  most  documents  in  this  data  set  as  very  similar.  As  a  consequence,  thematic 
differentiation  of  this  data  set  is  less  pronounced  than  we  would  normally  consider  desirable. 

Several  different  analyses  were  generated  to  assess  the  nature  of  this  document  set.  First,  we 
examined  the  correlation  matrix  (which  establishes  the  correlations  between  all  major  terms  and 
the  key  topics),  and  found  consistently  lower  than  normally  acceptable  values.  A  subset  of  the 
matrix  can  be  found  in  Appendix  —  Figure  1.  A  random  sample  of  terms  is  provided  to  convey 
the  nature  of  the  correlation  matrix  content.  The  term  at  the  top  of  each  column  is  the  term  to 
which  other  terms  are  related.  The  values  are  normalized  and  in  a  more  strongly  related  set  of 
documents  may  average  0.6  or  higher.  As  the  sample  conveys,  the  average  for  the  trauma  data  is 
about  0. 15  for  the  ten  most  related  terms.  Additionally,  the  number  of  connected  terms  in  a 
"normal"  data  set  is  typically  much  greater  (terms  with  non-zero  values),  which  again 
demonstrates  the  structured  nature  of  this  data.  Generally,  the  information  found  in  this  matrix 
indicates  that  the  documents  don't  tend  to  group  into  easily  differentiated  clusters  and  that 
relationships  which  would  otherwise  be  small  enough  to  ignore,  can  have  a  dominant  influence  on 
thematic  distribution. 
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A  second  analysis  we  performed  was  to  identify  the  number  of  statistically  important  topic  terms 
in  the  data  set  relative  to  the  unique  word  count.  This  is  commonly  called  the  noise  to  signal  ratio, 
the  “signal”  being  the  important  topics  and  the  rest  of  the  vocabulary  being  the  “noise”.  There 
were  only  13  major  topics  found  in  the  data  set  with  2806  words  in  the  vocabulary.  This 
percentage  (.005%)  is  extremely  low  for  significant  analysis.  Significantly  fewer  high-value 
topical  terms  were  found  in  this  data  set  than  were  found  in  more  normal  expressions  of  natural 
language  communication  such  as  news  articles  or  research  papers  or  even  WWW  pages.  For 
example,  a  data  set  run  on  CNN  news  data  (635  small  documents)  gave  approximately  450  major 
terms  and  10,000  words  in  the  vocabulary  for  a  percentage  of  4%.  The  trauma  data  set  possesses 
a  high  noise  to  signal  ratio,  making  it  difficult  to  find  meaningful  information. 


Appendix  -  Figure  2  is  another  illustration  of  the  lack  of  discriminating  dimensions  in  the  data 
set.  In  this  graph,  there  are  three  documents  which  appear  in  very  different  locations  in  the  2D 
scatter  plot:  two  are  proximal,  the  third  is  distant  from  the  first  two  which  is  shown  by  Appendix 
—  Figures.  Thirteen  topics  along  with  their  relative  magnitudes  are  graphed.  Typically,  we 
would  expect  to  be  able  to  quickly  identify  proximal  and  distant  document  pairs  due  to  the  strong 
diversity  in  content  represented  by  the  magnitudes  of  each  dimension—the  relative  magnitudes  at 
each  dimension  of  proximal  documents  would  be  close  while  the  values  for  distant  documents 
would  be  quite  different.  A  couple  of  dimensions  or  topics  in  this  data  set,  “pilon”  and  “travel” 
follow  this  pattern.  However,  the  other  topics  show  more  random  values  for  both  proximal  and 
distant  pairs.  This  again  shows  that  the  data  is  not  rich  in  discernible  content,  at  least  as  measured 
by  SPIRE. 

There  were,  however,  some  positive  results  that  are  worthy  of  note.  We  were  able  to  identify 
thematic  clusters  which  could  provide  some  insights  to  an  analyst,  depending  upon  their  domain- 
specific  requirements.  For  example,  a  query  on  the  word,  "tibia,"  shows  that  all  tibial  fractures 
tend  to  group  together  in  the  lower  middle  quadrant  of  the  themescape  as  shown  in  Appendix  — 
Figures  4  and  5.  Further  exploration  with  a  larger  data  set  might  enable  us  to  discover 
correlations  with  such  elements  as  vehicle  type  or  speed,  age  of  "case,"  and  so  on. 


Conclusions 

In  sununary,  what  we’ve  determined  is  that  this  particular  data  set  has  a  degree  of  structure  which 
makes  it  difficult  for  the  SPIRE  system  to  meaningfully  process.  Each  document  discusses  the 
same  general  topics  in  the  same  general  language;  therefore,  the  language  used  doesn’t  convey 
the  importance  of  words  in  the  manner  to  which  SPIRE  is  tuned.  There  is,  however,  information 
in  the  structure  itself  that  has  potential.  Our  domain  expert.  Dr.  Howard  Champion,  agrees  with 
this  analysis  and  believes  that  SPIRE  accurately  portrayed  the  information  in  the  trauma  data. 
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At  least  two  short  term  steps  might  be  taken  to  improve  the  quality  of  the  results.  First,  acquiring 
a  larger  data  set  might  yield  a  better  correlation  matrix  in  terms  of  strength  and  number  of 
word/word  correlations,  although  we  are  skeptical  that  more  data  of  exactly  the  same  type  would 
produce  appreciably  better  results.  Second,  it  might  be  possible  to  separate  or  eliminate  some  of 
the  non-injury  related  vocabulary  in  the  belief  that  the  remaining  data  will  map  out  more 
meaningfijlly.  This  effort  would  eliminate  some  of  the  “noise”  in  this  signal. 

Further  research  into  this  type  of  data  and  how  to  process  and  visualize  it  in  a  meaningful  way  is 
required  for  more  substantive  progress.  The  field  of  visual  analysis  of  structured  data  is  new  and 
innovative.  Research  would  include  understanding  the  structure  and  gaining  knowledge  fi-om  the 
structure.  It  would  also  include  new  ways  of  visualizing  structured  data  that  might  combine 
current  data  mining  techniques  with  new  visualization  techniques  such  as  SPIRE.  SPIRE  could 
be  expanded  to  support  this  area  in  conjunction  with  it’s  current  architecture  for  dealing  with 
unstructured  text. 
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Appendix  -  Figures  and  Charts 
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Figure  1.  Correlation  Matrix 


The  following  matrices  show  a  random  sample  of  12  terms  and  the  normalized  correlation  value  of  10 
related  terms.  Strongly  related  document  sets  average  0.6.  Trauma  data  average  is  approximately  0.15. 
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Figure  2  -  Sample  Trauma  Vectors 
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